Statistical functions with missing values

Sum of Vector Elements (sum())

The following function returns the sum of the elements of a vector.

[[cpp11::register]] double sum2_cpp_(doubles x, bool na_rm = false) {
  int n = x.size();
  double total = 0;
  for (int i = 0; i < n; ++i) {
    if (na_rm && ISNAN(x[i])) {
      continue;
    } else {
      total += x[i];
    }
  }
  return total;
}

I also have to add the corresponding auxiliary function for the documentation.

#' Return the sum of the coordinates of a vector (C++)
#' @inheritParams sum_r
#' @param na_rm logical. Should missing values (including `NaN`) be removed?
#' @export
sum2_cpp <- function(x, na_rm = FALSE) {
  sum2_cpp_(as.double(x), na_rm = na_rm)
}

To test, I run the following lines in the R console.

library(bench)

set.seed(123) # for reproducibility
x <- runif(1e3) # 1,000,000 elements
x[sample(1:1e3, 1e2)] <- NA # change some elements to NA at random

sum(x, na.rm = FALSE)
[1] NA
sum2_cpp(x, na_rm = FALSE)
[1] NA
sum(x, na.rm = TRUE)
[1] 447.4191
sum2_cpp(x, na_rm = TRUE)
[1] 447.4191
mark(
  sum(x, na.rm = TRUE),
  sum2_cpp(x, na_rm = TRUE)
)
# A tibble: 2 × 6
  expression                     min   median `itr/sec` mem_alloc `gc/sec`
  <bch:expr>                <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
1 sum(x, na.rm = TRUE)         777ns    815ns  1127489.        0B        0
2 sum2_cpp(x, na_rm = TRUE)   11.1µs   11.7µs    80548.        0B        0

Arithmetic Mean (mean())

This function returns the average (or mean) of the elements of a vector.

[[cpp11::register]] double mean2_cpp_(doubles x, bool na_rm = false) {
  int n = x.size();

  int m = 0;
  for (int i = 0; i < n; ++i) {
    if (na_rm && ISNAN(x[i])) {
      continue;
    } else {
      ++m;
    }
  }

  if (m == 0) {
    return NA_REAL;
  }

  double y = 0;
  for (int i = 0; i < n; ++i) {
    if (na_rm && ISNAN(x[i])) {
      continue;
    } else {
      y += x[i];
    }
  }

  return y / m;
}

I also need to add the corresponding auxiliary function for the documentation.

#' Return the mean of the coordinates of a vector (C++)
#' @inheritParams sum2_cpp
#' @export
mean2_cpp <- function(x, na_rm = FALSE) {
  mean2_cpp_(as.double(x), na_rm = na_rm)
}

A benchmark of the two functions is the following.

mean(x)
[1] NA
mean2_cpp(x)
[1] NA
mean(x, na.rm = TRUE)
[1] 0.4971323
mean2_cpp(x, na_rm = TRUE)
[1] 0.4971323
mark(
  mean(x, na.rm = TRUE),
  mean2_cpp(x, na_rm = TRUE)
)
# A tibble: 2 × 6
  expression                      min   median `itr/sec` mem_alloc `gc/sec`
  <bch:expr>                 <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
1 mean(x, na.rm = TRUE)        6.93µs   7.97µs   104027.    22.5KB     41.6
2 mean2_cpp(x, na_rm = TRUE)  18.11µs  18.96µs    51371.        0B      0  

Variance (var())

This function returns the variance of the elements of a vector.

[[cpp11::register]] double var2_cpp_(doubles x, bool na_rm = false) {
  int n = x.size();

  if (n < 2) {
    return NA_REAL;
  }

  int m = 0;
  for (int i = 0; i < n; ++i) {
    if (na_rm && ISNAN(x[i])) {
      continue;
    } else {
      ++m;
    }
  }

  if (m < 2) {
    return NA_REAL;
  }

  double ex = 0;
  for (int i = 0; i < n; ++i) {
    if (na_rm && ISNAN(x[i])) {
      continue;
    } else {
      ex += x[i];
    }
  }
  ex /= m;

  double out = 0;
  for (int i = 0; i < n; ++i) {
    if (na_rm && ISNAN(x[i])) {
      continue;
    } else {
      out += pow(x[i] - ex, 2);
    }
  }

  return out / (m - 1);
}

I also need to add the corresponding auxiliary function for the documentation.

#' Return the variance of the coordinates of a vector (C++)
#' @inheritParams sum2_cpp
#' @export
var2_cpp <- function(x, na_rm = FALSE) {
  var2_cpp_(as.double(x), na_rm = na_rm)
}

A benchmark of the two functions is the following.

var(x)
[1] NA
var2_cpp(x)
[1] NA
var(x, na.rm = TRUE)
[1] 0.08155043
var2_cpp(x, na_rm = TRUE)
[1] 0.08155043
mark(
  var(x, na.rm = TRUE),
  var2_cpp(x, na_rm = TRUE)
)
# A tibble: 2 × 6
  expression                     min   median `itr/sec` mem_alloc `gc/sec`
  <bch:expr>                <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
1 var(x, na.rm = TRUE)        7.35µs   8.54µs   110455.    3.95KB     33.1
2 var2_cpp(x, na_rm = TRUE)  44.51µs  46.09µs    21216.        0B      0  

Root Mean Square Error

The next function returns the measure of the differences between the observed values or an estimator (x1, x2, …, xn) and the true value (x0). For example, x1, …, xn could be experimental averages and x0 the true average.

[[cpp11::register]] double rmse2_cpp_(doubles x, double x0) {
  int n = x.size();

  int m = 0;
  for (int i = 0; i < n; ++i) {
    if (ISNAN(x[i])) {
      continue;
    } else {
      ++m;
    }
  }

  if (m == 0) {
    return NA_REAL;
  }

  double out;
  for (int i = 0; i < n; ++i) {
    if (na_rm && ISNAN(x[i])) {
      continue;
    } else {
      out += pow(x[i] - x0, 2.0);
    }
  }
  return sqrt(out / m);
}

I also need to add the corresponding auxiliary function for the documentation.

#' Return the root mean square error (C++)
#' @inheritParams rmse_r
#' @param na_rm logical. Should missing values (including `NaN`) be removed?
#' @export
rmse2_cpp <- function(x, x0, na_rm = FALSE) {
  rmse2_cpp_(as.double(x), as.double(x0), na_rm = na_rm)
}

A benchmark of the base R versus C++ implementation is the following.

# create a list with 100 normal distributions with mean 0 and 1 million elements
# each
set.seed(123) # for reproducibility
x <- list()
for (i in 1:1e3) {
  x[[i]] <- rnorm(1e3)
}

# compute the mean of each distribution
x <- sapply(x, mean)

# remove some elements at random
x[sample(1:1e3, 1e2)] <- NA

rmse2_cpp(x, 0)
[1] NA
rmse2_cpp(x, 0, na_rm = TRUE)
[1] 0.02992032
mark(
  sqrt(mean((x - 0)^2, na.rm = TRUE)),
  rmse2_cpp(x, 0, na_rm = TRUE)
)
# A tibble: 2 × 6
  expression                            min  median `itr/sec` mem_alloc `gc/sec`
  <bch:expr>                        <bch:t> <bch:t>     <dbl> <bch:byt>    <dbl>
1 sqrt(mean((x - 0)^2, na.rm = TRU…  8.37µs  9.36µs   100294.    30.4KB     60.2
2 rmse2_cpp(x, 0, na_rm = TRUE)     33.46µs  34.8µs    28194.        0B      0  

References